I know what you’re thinking: “Oh no, not another thing about coronavirus ugh.” But I’m happy to tell you that we’re looking at COVID’s RNA virus sibling, good ‘ol Orthomyxoviridae Influenza. I’ve done quite a few research papers in college about the flu, and while we all may be desensitized the wrath of the flu, or shall I say, “coronavirus’ shadow”, I find the common ailment facinating. I discovered the CDC has a whole archive of flu vaccination statistics from the past 10 years, and that’s where I discovered the ’vaccinations’ dataset, described below. While you think it may be easy to get the CDC to cough up case numbers per state in the year of the great swine flu (2009) this is not the case (puns intended). I am not proud of it, but the ‘flu_cases’ data set came from Wikipedia. Unfortunately, both the fact that the validity of these case numbers may be compromised, and the fact that the data is raw number of cases rather than a rate (divided by that state’s population during the flu season) has made the numbers, and therefore predictions somewhat precarious, I still predicted that states with the highest number of vaccinations would have the lowest number of cases, hospitalizations, and deaths. So please, peruse the wonderful stats I have discovered and displayed through the power of data science!
This data is about flu cases/hospitalizations/deaths per state and percent of people who recieved the flu vaccination 18+ years in 2009-2010 Swine Flu Epidemic.
setwd("/Users/emilyreed/Desktop")
flu_cases <- read.csv("Flu_cases.csv")
vaccinations <- read.csv("Vaccinations.csv")
Flu cases has 56 observations (1 per state, plus a few territories). “Cases”, “hospitalizations”, and “deaths” refers to the number of cases, hospital visits, and death due to influenza reported in that state for the 2009-2010 flu season. This data is NOT separated by age group.
# flu cases table turning cases and
# hospitalizations into numeric values for later
# calculations
flu_cases1 <- flu_cases %>% mutate_at("Hospitalizations",
as.numeric)
glimpse(flu_cases1)
## Rows: 56
## Columns: 4
## $ State <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Califor…
## $ cases <dbl> 2453, 1563, 8726, 154, 10545, 1321, 5491, 381, 54, 3…
## $ Hospitalizations <dbl> NA, 18, NA, NA, NA, 578, 766, NA, NA, NA, 860, NA, 3…
## $ Deaths <int> 19, 13, 152, 32, 657, 70, 35, 7, 1, 230, 81, 13, 23,…
Vaccinations is a dataset with 51 observations (one per state, plus District of Colubmbia). “Perc_vaccinated” refers to the percentage of each age group vaccinated in each state in the 2009-2010 flu season. There are three age groups: less than 6 months, 6 months to 17 years, and 18 years old and up.
# flu vaccinations table
glimpse(vaccinations)
## Rows: 51
## Columns: 4
## $ Names <chr> "Alabama", "Alaska", "Arizona", "Arka…
## $ perc_vaccinated.less6months <dbl> 45.8, 45.1, 45.9, 55.2, 45.4, 49.7, 5…
## $ perc_vaccinated.6months_17years <dbl> 53.2, 51.3, 53.1, 66.0, 53.3, 55.8, 6…
## $ perc_vaccinated.18years_up <dbl> 43.0, 42.9, 43.6, 49.4, 42.3, 47.8, 4…
Vaccinations dataset needs to be ’tidy’ed!
vaccinations2 <- vaccinations %>%
pivot_longer(2:4, names_to="age_group",values_to="percentage_vaccinated") %>% #pivoting to condense all perc_vaccinated rows
separate("age_group", into=c("percent", "age_group"), sep="\\.", convert=T ) %>% #separating perc_vaccinated.agegroup into age group and percent
select(-percent) %>% rename("State"="Names") #getting rid of pointless row "percent" and renaming "name" to "state" for joining and readability purposes
glimpse(vaccinations2)
## Rows: 153
## Columns: 3
## $ State <chr> "Alabama", "Alabama", "Alabama", "Alaska", "Ala…
## $ age_group <chr> "less6months", "6months_17years", "18years_up",…
## $ percentage_vaccinated <dbl> 45.8, 53.2, 43.0, 45.1, 51.3, 42.9, 45.9, 53.1,…
# Lets join the data!
cases_vaccinations <- inner_join(flu_cases1, vaccinations2,
by = "State", convert = T)
glimpse(cases_vaccinations)
## Rows: 150
## Columns: 6
## $ State <chr> "Alabama", "Alabama", "Alabama", "Alaska", "Ala…
## $ cases <dbl> 2453, 2453, 2453, 1563, 1563, 1563, 8726, 8726,…
## $ Hospitalizations <dbl> NA, NA, NA, 18, 18, 18, NA, NA, NA, NA, NA, NA,…
## $ Deaths <int> 19, 19, 19, 13, 13, 13, 152, 152, 152, 32, 32, …
## $ age_group <chr> "less6months", "6months_17years", "18years_up",…
## $ percentage_vaccinated <dbl> 45.8, 53.2, 43.0, 45.1, 51.3, 42.9, 45.9, 53.1,…
I performed an inner join, which allowed me to take only the State variables found both in flu cases and vaccinations data sets. I used an inner join because I only wanted to look at flu cases and vaccinations in the 50 formal United States, as well as avoid extra observations with a bunch of missing values. Using ‘inner_join’, I dropped the ‘Washington DC’ observation from the vaccinations dataset, and all the U.S territories (i.e. American Samoa, Guam, Puerto Rico,Northen Mariana Islands, U.S Virgin Islands ) from the flu_cases data set. The final merged data set, “cases_vaccinations” contains 6 variables of State, cases, hospitalizations, deaths, age_group, percentage_vaccinated, and 150 observations total (1 for each state at a particular age group)
Finding the average % vaccinated for each state across all age groups.
# Find mean % of vaccinations across all age groups
# for each state
cases_vaccinations <- cases_vaccinations %>% group_by(State) %>%
mutate(average_vaccinations = mean(percentage_vaccinated))
cases_vaccinations
## # A tibble: 150 x 7
## # Groups: State [50]
## State cases Hospitalizations Deaths age_group percentage_vacc…
## <chr> <dbl> <dbl> <int> <chr> <dbl>
## 1 Alab… 2453 NA 19 less6mon… 45.8
## 2 Alab… 2453 NA 19 6months_… 53.2
## 3 Alab… 2453 NA 19 18years_… 43
## 4 Alas… 1563 18 13 less6mon… 45.1
## 5 Alas… 1563 18 13 6months_… 51.3
## 6 Alas… 1563 18 13 18years_… 42.9
## 7 Ariz… 8726 NA 152 less6mon… 45.9
## 8 Ariz… 8726 NA 152 6months_… 53.1
## 9 Ariz… 8726 NA 152 18years_… 43.6
## 10 Arka… 154 NA 32 less6mon… 55.2
## # … with 140 more rows, and 1 more variable: average_vaccinations <dbl>
# find states with average vaccinations % that are
# above the National average vaccination %
national_average_vaccinated <- cases_vaccinations %>%
filter(!is.na(average_vaccinations)) %>% summarize(national_avg = mean(average_vaccinations)) %>%
summarize(mean(national_avg))
national_average_vaccinated
## # A tibble: 1 x 1
## `mean(national_avg)`
## <dbl>
## 1 51.6
cases_vaccinations %>% filter(average_vaccinations >
national_average_vaccinated) %>% select(State,
average_vaccinations) %>% summarize(avg_vaccination_percent = mean(average_vaccinations)) %>%
arrange(desc(avg_vaccination_percent))
## # A tibble: 21 x 2
## State avg_vaccination_percent
## <chr> <dbl>
## 1 Rhode Island 66.3
## 2 Hawaii 64.5
## 3 Vermont 62.3
## 4 Massachusetts 61.6
## 5 South Dakota 61.0
## 6 Maine 60.6
## 7 New Hampshire 58.9
## 8 Virginia 58.2
## 9 Iowa 57.9
## 10 Minnesota 57.4
## # … with 11 more rows
Let’s see which states had the most cases of flu. Which states had the fewest percent average vaccinated citizens?
# Which states had the most cases of flu
cases_vaccinations %>% group_by(State) %>% summarize(case = mean(cases)) %>%
arrange(desc(case))
## # A tibble: 50 x 2
## State case
## <chr> <dbl>
## 1 Pennsylvania 10940
## 2 California 10545
## 3 Wisconsin 9579
## 4 Arizona 8726
## 5 Texas 6128
## 6 Nevada 5516
## 7 Connecticut 5491
## 8 Florida 3676
## 9 Illinois 3387
## 10 New York 2738
## # … with 40 more rows
# Which states had the lowest average % vaccinated
# across all age groups
cases_vaccinations %>% group_by(State) %>% summarize(avg_vaccination_percent = mean(average_vaccinations)) %>%
arrange(avg_vaccination_percent)
## # A tibble: 50 x 2
## State avg_vaccination_percent
## <chr> <dbl>
## 1 Nevada 40.2
## 2 Idaho 43.7
## 3 Michigan 43.7
## 4 Florida 43.8
## 5 Georgia 43.8
## 6 Mississippi 44.3
## 7 Montana 44.3
## 8 Oregon 46.1
## 9 Alaska 46.4
## 10 Missouri 46.6
## # … with 40 more rows
# Do the states that had the most flu match with
# the states that had the lowest average %
# vaccinated? (Nope)
cases_vaccinations %>% group_by(State) %>% summarize(case_in_state = mean(cases),
avg_vaccination_state = (mean(average_vaccinations))) %>%
arrange(desc(case_in_state), avg_vaccination_state) %>%
select(State, case_in_state, avg_vaccination_state)
## # A tibble: 50 x 3
## State case_in_state avg_vaccination_state
## <chr> <dbl> <dbl>
## 1 Pennsylvania 10940 51.8
## 2 California 10545 47
## 3 Wisconsin 9579 48.9
## 4 Arizona 8726 47.5
## 5 Texas 6128 47.9
## 6 Nevada 5516 40.2
## 7 Connecticut 5491 53.3
## 8 Florida 3676 43.8
## 9 Illinois 3387 48.5
## 10 New York 2738 50.6
## # … with 40 more rows
Computing the mean and sd for national cases, hospitalizations, and deaths.
# Finding the national average of cases
national_average_cases <- cases_vaccinations %>% summarize(average_cases_US = mean(cases)) %>%
summarize(average_cases_US = mean(average_cases_US))
national_average_cases
# Finding the national sd of cases between states
national_sd_cases <- cases_vaccinations %>% summarize(sd_cases_US = mean(cases)) %>%
summarize(sd_cases = sd(sd_cases_US))
national_sd_cases
# Finding the national average of hospitalizations
national_average_hospitalizations <- cases_vaccinations %>%
filter(!is.na(Hospitalizations)) %>% summarize(average_hosp_US = mean(Hospitalizations)) %>%
summarize(average_hosp_US = mean(average_hosp_US))
national_average_hospitalizations
# Finding the national sd of hospitalizations
# between states
national_sd_hospitalizations <- cases_vaccinations %>%
filter(!is.na(Hospitalizations)) %>% summarize(sd_hosp_US = mean(Hospitalizations)) %>%
summarize(sd_hosp_US = sd(sd_hosp_US))
national_sd_hospitalizations
# Finding the national average of deaths (this data
# is obviously not very accurate, oop)
national_average_deaths <- cases_vaccinations %>% filter(!is.na(Deaths)) %>%
summarize(average_deaths_US = mean(Deaths)) %>%
summarize(average_deaths_US = mean(average_deaths_US))
national_average_deaths
# Finding the national sd of deaths between states
national_sd_deaths <- cases_vaccinations %>% filter(!is.na(Deaths)) %>%
summarize(sd_deaths_US = mean(Deaths)) %>% summarize(sd_deaths_US = sd(sd_deaths_US))
national_sd_deaths
cases_vaccinations %>% arrange(-Hospitalizations)
# make a table of all means/sd
table_mean_sd <- data.frame(national_average_cases,
national_sd_cases, national_average_hospitalizations,
national_sd_hospitalizations, national_average_deaths,
national_sd_deaths)
table_mean_sd %>% pivot_longer(1:6, names_to = "var",
values_to = "stat") %>% separate(var, into = c("stat_type",
"vars")) %>% pivot_wider(names_from = stat_type,
values_from = stat)
## # A tibble: 3 x 3
## vars average sd
## <chr> <dbl> <dbl>
## 1 cases 2264. 2656.
## 2 hosp 407. 333.
## 3 deaths 66.6 100.
Finding the quantile, min, max for all numeric variables.
case_quantile <- cases_vaccinations$cases
cases_quantile <- quantile(case_quantile, c(0.25, 0.5,
0.75))
min_cases <- min(case_quantile)
max_cases <- max(case_quantile)
percent_vaccinated <- cases_vaccinations$percentage_vaccinated
vax_quantile <- quantile(percent_vaccinated, c(0.25,
0.5, 0.75), na.rm = T)
min_vax <- min(percent_vaccinated)
max_vax <- max(percent_vaccinated)
table_quantile <- data.frame(cases_quantile, min_cases,
max_cases, vax_quantile, min_vax, max_vax)
# showing table of quantile, min and max
table_quantile %>% as.data.frame %>% rownames_to_column("quantile") %>%
pivot_longer(c(cases_quantile, vax_quantile), names_to = "quantile_var",
values_to = "quantile_number") %>% separate(quantile_var,
into = c("quantile_var", "NA"), sep = "_", ) %>%
select(-"NA") %>% group_by(quantile_var) %>% pivot_longer(2:5,
names_to = "min_max", values_to = "min_max_values") %>%
separate(min_max, into = c("min_max", "type")) %>%
filter(quantile_var == type) %>% select(-type) %>%
pivot_wider(names_from = quantile, values_from = quantile_number) %>%
pivot_wider(names_from = min_max, values_from = min_max_values)
## # A tibble: 2 x 6
## # Groups: quantile_var [2]
## quantile_var `25%` `50%` `75%` min max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 cases 722 1306. 2221 54 10940
## 2 vax 45.9 49.8 55.2 38.1 82.1
The national average vaccination percentage was 51% for the 2009-2010 swine flu season. It would be interesting to see if the seasons that the national average was higher also had lower number of cases. The average and standard deviation for numeric variables “Cases”, “Hospitalizations”, and “Deaths” between each state, and are displayed in table format. The quantile cutoff values for cases and average percent vaccinated for all three age groups per state were calculated, as well as the minimum and maximum number of cases and vaccination percentage across all 50 states.
# make unique ID
df <- unite(cases_vaccinations, State, age_group, col = "State_AgeGroup",
remove = F) %>% ungroup() %>% select(-State) %>%
select(-age_group)
corheatmat <- df %>% na.omit %>% select_if(is.numeric) %>%
cor(use = "pair")
tidycor <- corheatmat %>% as.data.frame %>% rownames_to_column("var1") %>%
pivot_longer(-1, names_to = "var2", values_to = "correlation")
tidycor %>% ggplot(aes(var1, var2, fill = correlation)) +
geom_tile() + scale_fill_gradient2(low = "red",
mid = "white", high = "blue") + geom_text(aes(label = round(correlation,
2)), color = "black", size = 4) + ggtitle("Correlation Heat Map of Numeric Variables") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
coord_fixed()
This correlation heat map shows what we would hopefully expect from this merged dataset. The “average_vaccinations” variable, which describes the average vaccination rate across all age groups for each state, is negatively correlated with deaths and hospitalizations, and very loosely coorelated (.02) with cases (probably by lack of data, or spurious correlation). We would want a negative correlation between these variables because it indicates that the more people that are vaccinated in that state, the fewer number of cases, hospitalizations, and deaths are recorded in that state. We also see a bunch of blue in the middle of the plot, which is also what we would expect. It is intuitive that the number of cases would naturally be postively correlated with an increase in hospitalizations and deaths.
cases_vaccinations %>% group_by(State) %>% filter(!is.na(average_vaccinations)) %>%
mutate(avg_vaccination_percent = mean(average_vaccinations)) %>%
select(State, avg_vaccination_percent, cases, age_group) %>%
filter(cases > national_average_cases) %>% ggplot() +
geom_point(aes(avg_vaccination_percent, cases,
col = State)) + ggtitle("Average % Vaccinations and Cases for State Cases Above National Average") +
xlab("Average Vaccination Percent") + ylab("Number of Cases")
This scatterplot depicts the relationship between number of cases and average vaccination percentage of all age groups for all the states that are above the national average number of cases (2264.32 cases). There does not seem to be any correlation depicted in this graph between the two variables (we would expect a negative correlation). Since that number of cases is not a rate per person per state, but rather the raw numbers, it makes sense that Texas and California are above the national average number of cases just by shear population size. It is interesting, however, that Pennsylvania has a high average vaccination percentage (51.833%), and yet a high case number (10940 cases). While this outlier seems to contridict graph 1, the increase in cases could be explained by the cold weather experienced during flu season, driving people in Pennsylvania into close quarters to spread the disease easier. This could also explain the high number of cases in New York and Connecticut despite a relatively high average vaccination percent.
# determining if states are in 1st, 2nd or 3rd, or
# 4th quantile for cases for bar graph below
quantile_lowest_cases <- quantile(case_quantile, 0.25)
quantile_highest_cases <- quantile(case_quantile, 0.75)
cases_vaccinations <- cases_vaccinations %>% filter(!is.na(age_group)) %>%
mutate(states_quantile = case_when(cases <= quantile_lowest_cases ~
"low", cases > quantile_lowest_cases & cases <=
quantile_highest_cases ~ "med", cases > quantile_highest_cases ~
"high"))
p1 = cases_vaccinations %>% group_by(State) %>% filter(!is.na(percentage_vaccinated)) %>%
ggplot(aes(x = age_group, y = percentage_vaccinated,
fill = age_group)) + geom_bar(stat = "summary") +
scale_fill_brewer(palette = "Dark2") + facet_wrap(states_quantile ~
State) + geom_hline(yintercept = 51.62267, linetype = "longdash") +
ggtitle("% Vaccinated by Age Group by State") +
theme(plot.title = (element_text(size = 20, face = "bold",
hjust = 0.5)), axis.title.x = element_blank(),
axis.text.x = element_blank(), strip.text = element_text(size = 2)) +
ylab("Percent Vaccinated")
p2 = cases_vaccinations %>% group_by(State) %>% filter(!is.na(percentage_vaccinated)) %>%
ggplot(aes(x = age_group, y = percentage_vaccinated,
fill = age_group)) + facet_wrap(~State) + geom_rect(aes(fill = states_quantile),
xmin = -Inf, xmax = Inf, ymin = -Inf, ymax = Inf) +
theme_minimal()
g1 <- ggplotGrob(p1)
g2 <- ggplotGrob(p2)
gtable_select <- function(x, ...) {
matches <- c(...)
x$layout <- x$layout[matches, , drop = FALSE]
x$grobs <- x$grobs[matches]
x
}
panels <- grepl(pattern = "panel", g2$layout$name)
strips <- grepl(pattern = "strip_t", g2$layout$name)
stript <- grepl(pattern = "strip-t", g2$layout$name)
g2$layout$t[panels] <- g2$layout$t[panels] - 1
g2$layout$b[panels] <- g2$layout$b[panels] - 1
new_strips <- gtable_select(g2, panels | strips | stript)
grid.newpage()
grid.draw(new_strips)
gtable_stack <- function(g1, g2) {
g1$grobs <- c(g1$grobs, g2$grobs)
g1$layout <- transform(g1$layout, z = z - max(z),
name = "g2")
g1$layout <- rbind(g1$layout, g2$layout)
g1
}
# BIG thanks to source code:
# https://stackoverflow.com/questions/19440069/ggplot2-facet-wrap-strip-color-based-on-variable-in-data-set?answertab=votes#tab-top
new_plot <- gtable_stack(g1, new_strips)
grid.newpage()
grid.draw(new_plot)
Hold onto your britches for this doozy. These faceted bar graphs are showing vaccination percentage for each recorded age group in each state, shown in the legend. The dashed line indicates the national average vaccinated percent (51.62267%). The facet labels are colored according to quantile distinction for number of cases in that state. A pink label indicates that the state is considered “high”, or above the 3rd quartile. The blue label indicates “medium”, or between the 1st and 3rd quantiles (between >25% and 75%). The green label indicates “low”, or in the 1st quartile for cases (25% and below).
Bars above the dashed line symbolize for that age group in that specific state, their vaccination percentage rate is above the overall national average percentage rate. It looks like the highest bar for percent vaccinations for the nation as a whole is the 6 months to 17 years age group. This is probably due to parental influence. Looking at the pink labeled states, it looks as though almost all these states’ purple “18 years+” group is less than the national average for vaccinations. It is interesting that Delaware, a green labeled group, looks as though their vaccination percentage across all groups is low, yet with a few number of cases. This may be due to Deleware’s small population, or maybe because the Delewarians are hermits that never leave the house ever, neither to get vaccinated or to catch the virus.
# PAM creating a data frame to examine clusters
df2 <- unite(cases_vaccinations, State, age_group,
col = "State_AgeGroup", remove = F) %>% ungroup() %>%
mutate_if(is.character, as.factor) %>% select(-State,
-age_group, -states_quantile) %>% na.omit
# choosing clusters
sil_width <- vector()
for (i in 2:50) {
pam_fit <- pam(df2, k = i)
sil_width[i] <- pam_fit$silinfo$avg.width
}
ggplot() + geom_line(aes(x = 1:50, y = sil_width)) +
scale_x_continuous(name = "k", breaks = 1:50) +
theme(axis.text.x = element_text(size = 5)) + geom_point(aes(x = 8,
y = 0.63), color = "red")
# clustering using PAM
pam1 <- df2 %>% select(-1) %>% scale %>% pam(k = 8)
pam1
## Medoids:
## ID cases Hospitalizations Deaths percentage_vaccinated
## [1,] 28 -0.3681047 -1.21289162 -0.5407067 -1.0827275
## [2,] 55 -0.3444917 -0.31923551 0.1087812 -0.3968249
## [3,] 7 3.5936327 1.09597415 -0.1823685 -0.2076104
## [4,] 10 -0.3235023 1.38267611 0.8478536 -1.1536829
## [5,] 61 -0.1468421 1.36132596 -0.4735183 -0.3140435
## [6,] 52 0.6113977 0.04676699 -0.4511221 0.7502881
## [7,] 31 -0.8324938 -1.11834097 -0.6302912 0.1826446
## [8,] 37 1.1859808 1.53212713 3.6473704 -0.5623876
## average_vaccinations
## [1,] -1.38751988
## [2,] -0.50677678
## [3,] 0.02271758
## [4,] -1.46615765
## [5,] 0.04368765
## [6,] 1.22849682
## [7,] 0.30581357
## [8,] -0.39668389
## Clustering vector:
## [1] 1 1 1 2 2 2 3 3 3 4 4 4 1 2 1 5 6 5 6 6 6 5 5 5 6 6 6 1 1 1 7 7 7 2 2 2 8 8
## [39] 8 2 2 2 7 7 7 2 2 2 7 6 7 6 6 6 2 2 2 7 6 7 5 5 5
## Objective function:
## build swap
## 0.8664379 0.8638319
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
# looking at clusters
df2 %>% mutate(cluster = pam1$clustering) %>% arrange(cluster)
## # A tibble: 63 x 7
## State_AgeGroup cases Hospitalizations Deaths percentage_vacc…
## <fct> <dbl> <dbl> <int> <dbl>
## 1 Alaska_less6m… 1563 18 13 45.1
## 2 Alaska_6month… 1563 18 13 51.3
## 3 Alaska_18year… 1563 18 13 42.9
## 4 Idaho_less6mo… 1171 389 23 42.9
## 5 Idaho_18years… 1171 389 23 41.4
## 6 Montana_less6… 961 9 19 44
## 7 Montana_6mont… 961 9 19 45.6
## 8 Montana_18yea… 961 9 19 43.4
## 9 Colorado_less… 1321 578 70 49.7
## 10 Colorado_6mon… 1321 578 70 55.8
## # … with 53 more rows, and 2 more variables: average_vaccinations <dbl>,
## # cluster <int>
# looking at final mediods
df2 %>% slice(pam1$id.med)
## # A tibble: 8 x 6
## State_AgeGroup cases Hospitalizations Deaths percentage_vacc… average_vaccina…
## <fct> <dbl> <dbl> <int> <dbl> <dbl>
## 1 Montana_less6… 961 9 19 44 44.3
## 2 Utah_less6mon… 988 302 48 49.8 49.9
## 3 Connecticut_l… 5491 766 35 51.4 53.3
## 4 Georgia_less6… 1012 860 81 43.4 43.8
## 5 West Virginia… 1214 853 22 50.5 53.4
## 6 South Dakota_… 2081 422 23 59.5 61.0
## 7 Nebraska_less… 430 40 15 54.7 55.1
## 8 New York_less… 2738 909 206 48.4 50.6
Its interesting that all the final mediods are in the vaccination age group of ‘less than 6 months.’
# graphing clusters
df3 <- df2 %>% select(-1) %>% mutate(cluster = as.factor(pam1$clustering))
df3 %>% ggpairs(columns = 1:6, aes(color = cluster),
upper = list(continuous = wrap("cor", size = 9))) +
theme_grey(base_size = 30)
I chose 8 clusters because it was a peak in the silhouette peak graph(shown by a red dot). The best number of clusters, 21, was too large to compute and would not have revealed anything significant about the clusters because there would be so little observations in each cluster. The next best cluster peak number was 14, but the graphs were too busy and confusing to interpret. The next best cluster number was 8, which is what was used to generate the graphs above.
Cluster 1 (red) is probably clustered most strongly by number of deaths, with a high peak of around 200. Cluster 2 (yellow) is tightly clustered at three peaks for average vaccinations, having the highest number of state_AgeGroups in its cluster. Cluster 5 and 6 aslo cluster tightly in average percent vaccinations, and weaker in deaths. Cluster 7 (purple) exhibits a cluster of state_AgeGroups with a high number of cases, hospitalization, and deaths. As expected from our correlation heat map, as well as from the correlations shown in the graph above, cluster 7 exhibits a low percentage vaccinated and a very wide distribution of average vaccinations. Overall, the clustering shown in the graph above does not reveal any striking cluster groups, but it overall reveals that despite differences in case numbers and vaccination % accross the county, there is not a clear group of states_ageGroups that are preventing the flu the best (with very low case numbers and high vaccination percent), but cluster 7 is the closest to earning this title.
gif image source: media.giphy